Skip to content

Improve exception catching in UCX communication#7132

Merged
quasiben merged 5 commits into
dask:mainfrom
pentschev:ucx-base-exception-read-frames
Oct 12, 2022
Merged

Improve exception catching in UCX communication#7132
quasiben merged 5 commits into
dask:mainfrom
pentschev:ucx-base-exception-read-frames

Conversation

@pentschev

Copy link
Copy Markdown
Member

Closes #7130

As was previously discussed in #6996, some of the exception catching could be improved for both read and write tasks which are now addressed by the changes contained here.

  • Tests added / passed
  • Passes pre-commit run --all-files

@pentschev

Copy link
Copy Markdown
Member Author

cc @wence-

@github-actions

github-actions Bot commented Oct 10, 2022

Copy link
Copy Markdown
Contributor

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±0         15 suites  ±0   6h 51m 25s ⏱️ + 21m 25s
  3 146 tests ±0    3 061 ✔️ +1    85 💤 ±0  0  - 1 
23 272 runs  ±0  22 357 ✔️ +3  915 💤  - 2  0  - 1 

Results for commit 61e5af0. ± Comparison against base commit 37fd8b5.

♻️ This comment has been updated with latest results.

@wence-

wence- commented Oct 10, 2022

Copy link
Copy Markdown
Contributor

Lints fails are addressed by #7131

@wence- wence- left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor typo, but otherwise looks good, thanks!

Comment thread distributed/comm/ucx.py Outdated
@wence-

wence- commented Oct 11, 2022

Copy link
Copy Markdown
Contributor

I think if you merge trunk (or rebase) the linting errors will disappear

@quasiben

Copy link
Copy Markdown
Member

The test_shutdown_localcluster failure is a known flaky tests:
#6625 (comment)

The test_sni failure is also unrelated but I've not seen it before. I'm going to restart CI but it would be good to keep an eye on it:

___________________________________ test_sni ___________________________________

loop = <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7fa96cb75730>

    def test_sni(loop):
        port = open_port()
        with popen(["dask-scheduler", "--no-dashboard", f"--port={port}"] + tls_args) as s:
            with popen(
                [
                    "dask-worker",
                    "--no-dashboard",
                    "--scheduler-sni",
                    "localhost",
                    f"tls://127.0.0.1:{port}",
                ]
                + tls_args
            ) as w:
                with Client(
                    f"tls://127.0.0.1:{port}", loop=loop, security=tls_security()
                ) as c:
>                   wait_for_cores(c)

distributed/cli/tests/test_tls_cli.py:60: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

c = <Client: 'tls://10.213.3.122:50230' processes=0 threads=0, memory=0 B>
nthreads = 1

    def wait_for_cores(c, nthreads=1):
        start = time()
        while len(c.nthreads()) < 1:
            sleep(0.1)
>           assert time() < start + 10
E           assert 1665511027.440665 < (1665511017.4086459 + 10)
E            +  where 1665511027.440665 = time()

distributed/cli/tests/test_tls_cli.py:30: AssertionError

@wence-

wence- commented Oct 12, 2022

Copy link
Copy Markdown
Contributor
>           assert time() < start + 10
E           assert 1665511027.440665 < (1665511017.4086459 + 10)
E            +  where 1665511027.440665 = time()

This seems like the kind of test that is very likely to fail on heavily loaded virtualised systems.

@quasiben

Copy link
Copy Markdown
Member

All tests pass now -- merging in.

@wence- if you have thoughts on improving either of the flaky tests they would be extremely appreciated

@quasiben quasiben merged commit 470f42c into dask:main Oct 12, 2022
@pentschev pentschev deleted the ucx-base-exception-read-frames branch October 17, 2022 09:57
gjoseph92 pushed a commit to gjoseph92/distributed that referenced this pull request Oct 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Catch exceptions on both send and recv in UCX comm

3 participants